1 Keeping R and RStudio Current

Both R and RStudio go through very frequent updates, some minor, some major. As needed, update your versions of R and RStudio. Remember to install packages you plan to use, and to check for updates every few weeks. In case you forgot how to, use the “Install Packages…” option under Tools to install packages, and see if there are package updates available by running “Check for Package Updates…” (also under Tools).

Let us go ahead and install some packages we will need.

packs = c("ggplot2", "ggmap", "plotly", "ggvis", "foreign", "haven", 
    "readxl", "plyr", "dplyr", "reshape2", "maps", "choroplethr", 
    "choroplethrMaps")
install.packages(packs, repos = "http://cran.rstudio.com")

library(devtools)
install_github("ramnathv/rCharts")

If the above packages installed fine then comment out the install.packages() commands by adding # before each install command. Otherwise R will reinstall these packages every time you run this file and that is a waste of time.

2 Reading Data

R can read data created in various formats (SPSS, SAS, Stata, Excel, CSV, TXT, etc). The most common data formats you will encounter are likely to be CSV or Excel files. Let us see how to read data in these formats by first downloading and saving the data available here (as a zip archive). Once this file downloads, double-click it and extract all four files to a new folder (title it Data) you create in your OU Box folder for the course.

  1. CSV & Tab-delimited Formats With the CSV format a comma separates each variable (a column), and each row in the original file represents an observation.

The first thing R will need to know is where your data reside. This can be accomplished either by setting the working directory or by explicitly specifying the path to your data. We will employ the second option for now.

df.csv = read.csv("~/Downloads/Archive/file1.csv", sep = ",", 
    header = TRUE)

df.csv is the name I have chosen to give to the data being read. I am telling R that it is in CSV format, where the file resides, the file-name, the fact that variables (one in each column) are separated by a comma (,), and the fact that the original data have column-headings (header=TRUE).

Note that when you create anything in R, you do so either via the = symbol or via <- symbol. Thus df.csv = read.csv(…) is the same as df.csv <= read.csv(…) but my suggestion would be to stick with =.

When you execute the command you will see df.csv showing up under Data in the upper-right pane of RStudio. Click on df.csv and you can see the data.

A similar process works for reading in tab-delimited files where the columns are separated by a tab rather than by a comma.

df.tab = read.csv("~/Downloads/Archive/file1.txt", sep = "\t", 
    header = TRUE)

Note the one difference here: I have told R it is a tab-delimited file by specifying sep=“\t”

  1. Excel Format (.xls & .xlsx) There are several packages that will allow you to read files in various Excel formats but the one I prefer is readxl. Whenever we need to use a package we will have to first load it and then execute whatever commands call upon the loaded package’s features as shown below.
library(readxl)
df.xls = read_excel("~/Downloads/Archive/file1.xls")
df.xlsx = read_excel("~/Downloads/Archive/file2.xlsx")

2.1 Reading Files from the Web

It is also possible to specify the full web-path for a file.

fpe = read.table("http://data.princeton.edu/wws509/datasets/effort.dat")
test = read.table("http://www.ats.ucla.edu/stat/data/test.txt", 
    header = TRUE)
test.csv = read.csv("http://www.ats.ucla.edu/stat/data/test.csv", 
    header = TRUE)

library(foreign)
hsb2.spss = read.spss("http://www.ats.ucla.edu/stat/spss/webbooks/reg/hsb2.sav")
df.hsb2.spss = as.data.frame(hsb2.spss)

rm("hsb2.spss")  # Deleting the intermediate file 

R is able to read data from Twitter feeds, Buoys sitting in the Atlantic ocean, and so much more!

3 Basic Data Operations in R

You can generate your own data, manipulate data by adding, subtracting, dividing, or multiplying, and convert numeric data to factors (qualitative variables), etc. We will see a few basic data operations at work below. Let us start by creating some data.

3.1 Creating A Small Data-Set

Let us create two variables, x and y.

x = c(100, 101, 102, 103, 104, 105, 106)
y = c(7, 8, 9, 10, 11, 12, 13)
df = as.data.frame(cbind(x, y))

The commands above generate two columns, x and y, and then bind them as columns into a data-set called df. If we used rbind() instead it would bind x and y as rows instead of columns.

x = c(100, 101, 102, 103, 104, 105, 106)
y = c(7, 8, 9, 10, 11, 12, 13)
df.rows = as.data.frame(rbind(x, y))

Note that when we use rbind() it names the columns V1, V2, and so on. Often we will want to label the columns differently from how they were read-in. This is easily accomplished:

names(df) = c("Variable 1", "Variable 2")
names(df.rows) = c("Variable 1", "Variable 2", "Variable 3", 
    "Variable 4", "Variable 5", "Variable 6", "Variable 7")

You can also generate data-sets that combine quantitative and qualitative variables. This is demonstrated below:

x = c(100, 101, 102, 103, 104, 105, 106)
y = c("Male", "Female", "Male", "Female", "Female", "Male", "Female")
df.1 = as.data.frame(cbind(x, y))

x = c(100, 101, 102, 103, 104, 105, 106)
y = c(0, 1, 0, 1, 1, 0, 1)
df.2 = as.data.frame(cbind(x, y))

Note that in df.1 y is a string variable with values of Male/Female. In contrast, df.2 has y specified as a 0/1 variable, with 0=Male and 1=Female. We could label the 0/1 values in df.2 as follows:

df.2$y = factor(df.2$y, labels = c("Male", "Female"))

If you click the “play” button before df.2 you will see the contents of the data-set. Note that x is shown as num (numeric) while y is shown as Factor with two levels “Males”, “Female”.

We can operate on x as follows:

df.2$x1 = df.2$x * 10
df.2$x2 = df.2$x * 100
df.2$x3 = df.2$x/10
df.2$x4 = sqrt(df.2$x)
df.2$x5 = df.2$x^(2)
df.2$x6 = df.2$x * 1.31

Note the various operators; we multiply via *, divide via /, take the square-root via sqrt(), and so on.

4 Saving R Data

We can save a data-set we have created quite easily (see below):

save(df.2, file = "~/Downloads/Archive/df2.RData")

Note the sequence. We specify the data set we want to save, here df.2, and then the location and filename of the saved data: file=“~/Downloads/Archive/df2.RData”. If you look at the folder specified in the command you will see a file called df2.RData.

5 Loading and Modifying Data

Let us load some larger data-sets, perhaps the hsb2 data we used last semester.

hsb2 = read.table("http://www.ats.ucla.edu/stat/r/modules/hsb2.csv", 
    header = TRUE, sep = ",")

Note that there are no labels for the various qualitative variables (female, race, ses, schtyp, and prog) so we’ll have to create these.

hsb2$female = factor(hsb2$female, labels = c("Male", "Female"))
hsb2$race = factor(hsb2$race, labels = c("Hispanic", "Asian", 
    "African American", "White"))
hsb2$ses = factor(hsb2$ses, labels = c("Low", "Middle", "High"))
hsb2$schtyp = factor(hsb2$schtyp, labels = c("Public", "Private"))
hsb2$prog = factor(hsb2$prog, labels = c("General", "Academic", 
    "Vocational"))

Having added labels to the factors in hsb2 we can now save the data for later use.

save(hsb2, file = "~/Downloads/Archive/hsb2.RData")

6 Descriptive Statistics

We can now start looking at some descriptive statistics – the usual mean, median, minimum, maximum, standard deviation, variance, etc. We won’t go into the statistical theory underlying these estimates since we covered this last semester. Let us start easy, by loading some data and then seeing some of the functions that give us summaries of our data.

6.1 The summary() Function

Let us load the hsb2.RData we read and saved earlier. If we want to look at its contents and get a quick feel for the distributions of each variable we can do so via the summary() function.

load("~/Downloads/Archive/hsb2.RData")

summary(hsb2)
id female race ses schtyp prog read write math science socst
Min. : 1.00 Male : 91 Hispanic : 24 Low :47 Public :168 General : 45 Min. :28.00 Min. :31.00 Min. :33.00 Min. :26.00 Min. :26.00
1st Qu.: 50.75 Female:109 Asian : 11 Middle:95 Private: 32 Academic :105 1st Qu.:44.00 1st Qu.:45.75 1st Qu.:45.00 1st Qu.:44.00 1st Qu.:46.00
Median :100.50 NA African American: 20 High :58 NA Vocational: 50 Median :50.00 Median :54.00 Median :52.00 Median :53.00 Median :52.00
Mean :100.50 NA White :145 NA NA NA Mean :52.23 Mean :52.77 Mean :52.65 Mean :51.85 Mean :52.41
3rd Qu.:150.25 NA NA NA NA NA 3rd Qu.:60.00 3rd Qu.:60.00 3rd Qu.:59.00 3rd Qu.:58.00 3rd Qu.:61.00
Max. :200.00 NA NA NA NA NA Max. :76.00 Max. :67.00 Max. :75.00 Max. :74.00 Max. :71.00

Note how you see each variable along with some key statistics. You don’t see the standard deviation or variance listed for the numeric variables but these are easily calculated.

There is a tedious way of getting these estimates. Instead, we can rely on some R packages to obtain these values. Before we do so, however, let us look at some of the functions we will use quite often. The code below uses a generic data frame (df) and a generic variable (x). You will have to replace df by whatever you have called the data frame and replace x by the actual name of the variable.

  • mean(df$x) \(\cdots\) the mean
  • sd(df$x) \(\cdots\) the standard deviation
  • var(df$x) \(\cdots\) the variance
  • min(df$x) \(\cdots\) the minimum
  • max(df$x) \(\cdots\) the maximum
  • quantile(df$x, c(0.25, 0.50, 0.75)) \(\cdots\) the first quartile, the median, the third quartile
  • IQR(df$x) \(\cdots\) the interquartile range
  • sum(df$x) \(\cdots\) the total of the values of variable x
  • scale(df$x) \(\cdots\) the z-score of variable x
  • cor(df$x1, df$x2) \(\cdots\) the correlation between x1 and x2
sd(hsb2$read)
## [1] 10.25294
var(hsb2$read)
## [1] 105.1227
quantile(hsb2$math, c(0.25, 0.59, 0.75))
##   25%   59%   75% 
## 45.00 54.41 59.00
IQR(hsb2$math)
## [1] 14
cor(hsb2$math, hsb2$science)
## [1] 0.6307332

6.2 Using data.table

There are a number of ways that we could generate tables of descriptive statistics for numeric variables in a data frame. One of the more promising ways is via the data.table package. On the web you can find several examples of how to accomplish a particular task but for now we will focus on generating simple tables. These tables are aggregates of means, standard deviations, etc. The commands that follow also use two other packages – knitr and printr – to dress-up the tables. As such, you will see these tables created in two steps, first generating the table we want and giving it a name (table.1, table.2, etc) and then dressing each table via the kable() command.

library(data.table)
DT = data.table(hsb2)

table.1 = DT[, list(Mean.Reading = mean(read), SD.Reading = sd(read), 
    Mean.Writing = mean(write), SD.Writing = sd(write), Mean.Math = mean(math), 
    SD.Math = sd(math))]

knitr::kable(table.1, digits = 2, booktabs = TRUE, caption = "Table 1: Descriptive Statistics")
Table 1: Descriptive Statistics
Mean.Reading SD.Reading Mean.Writing SD.Writing Mean.Math SD.Math
52.23 10.25 52.77 9.48 52.65 9.37

If we want these estimates for, say, Male versus Female students, and then perhaps for Male versus Female students in Public versus Private schools, we can do so via:

table.2 = DT[, list(Mean.Reading = mean(read), SD.Reading = sd(read), 
    Mean.Writing = mean(write), SD.Writing = sd(write), Mean.Math = mean(math), 
    SD.Math = sd(math)), by = "female"]

kable(table.2, digits = 2, booktabs = TRUE, caption = "Table 2: Descriptive Statistics (by Gender)")
Table 2: Descriptive Statistics (by Gender)
female Mean.Reading SD.Reading Mean.Writing SD.Writing Mean.Math SD.Math
Male 52.82 10.51 50.12 10.31 52.95 9.66
Female 51.73 10.06 54.99 8.13 52.39 9.15
table.3 = DT[, list(Mean.Reading = mean(read), SD.Reading = sd(read), 
    Mean.Writing = mean(write), SD.Writing = sd(write), Mean.Math = mean(math), 
    SD.Math = sd(math)), by = list(female, schtyp)]

kable(table.3, digits = 2, booktabs = TRUE, caption = "Table 3: Descriptive Statistics (by Gender & School-type)")
Table 3: Descriptive Statistics (by Gender & School-type)
female schtyp Mean.Reading SD.Reading Mean.Writing SD.Writing Mean.Math SD.Math
Male Public 52.35 10.81 49.36 10.54 52.31 9.57
Female Public 51.42 10.12 54.69 8.41 52.19 9.37
Male Private 55.43 8.50 54.29 8.00 56.43 9.81
Female Private 53.33 9.85 56.50 6.54 53.44 8.13

data.table is also useful for collapsing a data frame by a select set of variables. This is often done when we want to calculate the mean or median for a specific group. Say, for example, we wanted the mean of all subject scores for Male versus Female students, or by Race. This could be achieved via:

table.4 = DT[, lapply(.SD, mean), by = female, .SDcols = c("read", 
    "write", "math", "science", "socst")]

table.5 = DT[, lapply(.SD, mean), by = race, .SDcols = c("read", 
    "write", "math", "science", "socst")]

kable(table.4, digits = 2, booktabs = TRUE, caption = "Table 4: Mean Scores (by Subject and Gender)")
Table 4: Mean Scores (by Subject and Gender)
female read write math science socst
Male 52.82 50.12 52.95 53.23 51.79
Female 51.73 54.99 52.39 50.70 52.92
kable(table.5, digits = 2, booktabs = TRUE, caption = "Table 5: Mean Scores (by Subject and Race)")
Table 5: Mean Scores (by Subject and Race)
race read write math science socst
White 53.92 54.06 53.97 54.20 53.68
African American 46.80 48.20 46.75 42.80 49.45
Hispanic 46.67 46.46 47.42 45.38 47.79
Asian 51.91 58.00 57.27 51.45 51.00

7 Frequency Tables

With qualitative variables such as female, race, etc. we know we can best represent their distributions via frequency tables. These can be created very easily, and then dressed up a bit. The basic command for a cross-tabulation of frequencies is table(df\(x1, df\)x2). This command does not create the row and column totals so we use the as.data.frame(addmargins(tab.x, FUN = Total)) command. The kable() command generates the final table when we knit the document.

tab.6a = table(hsb2$ses)
Total = sum
tab.6b = as.data.frame(addmargins(tab.6a, FUN = Total))
colnames(tab.6b) = c("SES Category", "Frequency")
kable(tab.6b, booktabs = TRUE, caption = "Table 6: Frequency Table of SES")
Table 6: Frequency Table of SES
SES Category Frequency
Low 47
Middle 95
High 58
Total 200

Table 6 is a table of simple frequencies. What if we wanted a table of relative frequencies (as proportions or percentages)? We could build such a table as follows:

tab.6c = prop.table(tab.6a) * 100
tab.6d = as.data.frame(addmargins(tab.6c, FUN = Total, quiet = TRUE))
colnames(tab.6d) = c("SES Category", "Frequency")
kable(tab.6d, booktabs = TRUE, caption = "Table 7: Relative Frequency Table of SES")
Table 7: Relative Frequency Table of SES
SES Category Frequency
Low 23.5
Middle 47.5
High 29.0
Total 100.0

These are simple frequency/relative frequency tables of a single variable. What if we wanted to cross-tabulate ses and schtyp?

tab.6e = table(hsb2$ses, hsb2$schtyp)
tab.6f = addmargins(tab.6e, FUN = Total, quiet = TRUE)
kable(tab.6f, digits = 0, booktabs = TRUE, caption = "Table 8: Crosstabulation of SES & School-type")
Table 8: Crosstabulation of SES & School-type
Public Private Total
Low 45 2 47
Middle 76 19 95
High 47 11 58
Total 168 32 200

The result is a cross-tabulation of frequencies. We can flip this into a table of relative frequencies by modifying the code used earlier. In particular, note the use of 1 in the prop.table() command and then the use of 2 in the addmargins() command. The 1 in prop.table says flip each row frequency into a proportion by dividing the frequency by the row total. When we add the resulting percentages (since the proportions have been multiplied by 100) we specify that addition must occur along the rows with the 2 in the addmargins() command.

tab.6g = prop.table(tab.6e, 1) * 100
tab.6h = addmargins(tab.6g, 2, FUN = Total, quiet = TRUE)
kable(tab.6h, digits = 2, booktabs = TRUE, caption = "Table 9: Crosstabulation of SES & School-type (Row Percentages)")
Table 9: Crosstabulation of SES & School-type (Row Percentages)
Public Private Total
Low 95.74 4.26 100
Middle 80.00 20.00 100
High 81.03 18.97 100

If we wanted column percentages then we would have to change things up a bit (see below):

tab.6i = prop.table(tab.6e, 2) * 100
tab.6j = addmargins(tab.6i, 1, FUN = Total, quiet = TRUE)
kable(tab.6j, digits = 2, booktabs = TRUE, caption = "Table 10: Crosstabulation of SES & School-type (Column Percentages)")
Table 10: Crosstabulation of SES & School-type (Column Percentages)
Public Private
Low 26.79 6.25
Middle 45.24 59.38
High 27.98 34.38
Total 100.00 100.00

This is a small snippet of some essential tables we might need to use but of course, as with all things R, this is but 1% of what R can do when it comes to constructing tables.

8 Graphics in R

Now we’ll see one of R’s premier packages in action when graphing data. Let us load the hsb2.RData we saved earlier.

load("~/Downloads/Archive/hsb2.RData")

ggplot2 is one of the leading R packages for graphics, followed closely by lattice. Let us work with ggplot2 first and fit some simple graphs. Note that there is extensive help available for ggplot2 on the web. You can start with the Cookbook for R or the ggplot2 documentation. You can also search on stackoverflow.

8.1 The Mechanics of ggplot2

ggplot2 uses the grammar of graphics to build graphs by breaking up each graph into three components – data, aesthetics, and geometry. You specify the data frame with the data command, then the x and y coordinates with the aes command, and finally the geometry (bar-chart, histogram, etc.) via the geom_ command. The geometry for some of the graphs we will use most often is listed below:

  • geom_bar() – bar-chart
  • geom_histogram() – histogram
  • geom_line() – line chart
  • geom_point() – scatte plot
  • geom_density() – density plots
  • geom_jitter() – stripcharts

9 Constructing Graphs

Recall that for numeric variables we can rely on box-plots and histograms to explore the distribution of a numeric (scale) variable. Perhaps we are interested in reading scores and want to start with a histogram.

9.1 Histograms

library(ggplot2)
ggplot(data = hsb2, aes(x = read)) + geom_histogram()

You see R telling you that stat_bin() using bins = 30. Pick better value with binwidth.. That is, R is automatically grouping read in a way that there are 30 groups. Maybe we want fewer groups, maybe 10. This can be done as follows:

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10)

We can customize this histogram further, changing the colors, the labels for the x-axis, the y-axis, adding a title, and so on.

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "cornflowerblue") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency")

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "salmon") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency")

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "deeppink1") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency")

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "yellowgreen") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency")

Note: A small snippet of the wide expanse of colors available in R can be seen here and you can always brew your own color palette (ask me and I’ll give you the code).

What if wanted to construct these histograms for male versus female students, or for each of the SES groups?

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "tomato") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency") + facet_wrap(~female)

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "tomato") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency") + facet_wrap(~ses)

What if we wanted to break it out by female/male students in public versus private schools?

ggplot(data = hsb2, aes(x = read)) + geom_histogram(bins = 10, 
    fill = "tomato") + ggtitle("Histogram of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency") + facet_wrap(female ~ 
    schtyp)

9.2 Kernel Density Plots

ggplot(data = iris, aes(x = Sepal.Length, fill = Species)) + 
    geom_density(alpha = 0.3, trim = TRUE)

ggplot(data = iris, aes(x = Sepal.Length)) + geom_histogram(aes(y = ..density..), 
    binwidth = 0.2, fill = "cornflowerblue") + ggtitle("Histogram & Kernel Density Plot of Reading Scores") + 
    xlab("Reading Score") + ylab("Frequency") + geom_density(alpha = 1, 
    color = "tomato4", trim = TRUE) + facet_wrap(~Species)

When we construct a histogram we choose the bin-widths (i.e., how many groups do we want and how wide should each group be?). As a result, histograms are not smooth, and depend on both the width of the bins and the end points of the bins. Kernel density plots get around two of these problems – they are smooth and do not depend on the end points of the bins. One can think of them as probability distribution functions, similar to the standard normal distribution \((z)\), that tell you how your data are distributed.

9.3 Box-plots

Now we can revisit our old friends, the box-plots.

ggplot(data = hsb2, aes(x = female, y = read)) + geom_boxplot(fill = "seagreen2") + 
    ggtitle("Box-Plot of Reading Scores") + xlab("Gender") + 
    ylab("Reading Score") + coord_flip()

ggplot(data = hsb2, aes(x = female, y = read)) + geom_boxplot(fill = "peachpuff") + 
    ggtitle("Box-Plot of Reading Scores (by Gender & School Type)") + 
    xlab("Gender") + ylab("Reading Score") + coord_flip() + facet_wrap(~schtyp)

9.4 Violin Plots

While box-plots are very useful for loking at the general shape of the distribution, violin plots tend to be more informative since they combine box-plots and kernel density plots. But not everyone likes these (or is used to them).

ggplot(data = hsb2, aes(x = female, y = read)) + geom_violin(fill = "seagreen2", 
    trim = FALSE, adjust = 0.5) + ggtitle("Violin Plots of Reading Scores") + 
    geom_boxplot(width = 0.1) + xlab("Gender") + ylab("Reading Score") + 
    coord_flip()

ggplot(data = hsb2, aes(x = female, y = read)) + geom_violin(fill = "peachpuff", 
    trim = FALSE, adjust = 0.5) + geom_boxplot(width = 0.1) + 
    ggtitle("Violin Plots of Reading Scores (by Gender & School Type)") + 
    xlab("Gender") + ylab("Reading Score") + coord_flip() + facet_wrap(~schtyp)

ggplot(data = hsb2, aes(x = female, y = read, fill = schtyp)) + 
    geom_violin(trim = FALSE, adjust = 0.25) + geom_boxplot(width = 0.1) + 
    ggtitle("Violin Plots of Reading Scores") + xlab("Gender") + 
    ylab("Reading Score") + coord_flip()

ggplot(data = hsb2, aes(x = female, y = read, fill = schtyp)) + 
    geom_violin(trim = FALSE, adjust = 0.5) + geom_boxplot(width = 0.1) + 
    ggtitle("Violin Plots of Reading Scores") + xlab("Gender") + 
    ylab("Reading Score") + coord_flip()

ggplot(data = hsb2, aes(x = female, y = read, fill = schtyp)) + 
    geom_violin(trim = FALSE, adjust = 0.75) + geom_boxplot(width = 0.1) + 
    ggtitle("Box-Plot of Reading Scores") + xlab("Gender") + 
    ylab("Reading Score") + coord_flip()

9.5 Bar-Charts

Recall the bar-charts we used for qualitative variables last semester. Let us generate a few for gender, schtyp, prog, ses, and race.

ggplot(data = hsb2, aes(female)) + geom_bar(fill = "seagreen2", 
    width = 0.25) + ggtitle("Bar-Chart of Gender") + xlab("Gender") + 
    ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, 
    hjust = 0))

ggplot(data = hsb2, aes(race)) + geom_bar(fill = "seagreen2") + 
    ggtitle("Bar-Chart of Race (by School Type)") + xlab("Race") + 
    ylab("Frequency") + facet_wrap(~schtyp) + theme(axis.text.x = element_text(angle = 90, 
    hjust = 0))

ggplot(data = hsb2, aes(race)) + geom_bar(fill = "seagreen2") + 
    ggtitle("Bar-Chart of Race (by SES & School Type)") + xlab("Race") + 
    ylab("Frequency") + facet_wrap(ses ~ schtyp) + theme(axis.text.x = element_text(angle = 90, 
    hjust = 0))

ggplot(data = hsb2, aes(race)) + geom_bar(fill = "seagreen2") + 
    ggtitle("Bar-Chart of Race (by SES & School Type)") + xlab("Race") + 
    ylab("Frequency") + facet_wrap(ses ~ schtyp, ncol = 2) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 0))

Clearly some of these aren’t very helpful because we have too few students of a given racial/ethnic group. Let us look at different rendition of a bar-chart.

ggplot(data = hsb2, aes(prog)) + geom_bar(aes(fill = ses)) + 
    ggtitle("Bar-Chart of Program Type (by SES & School Type)") + 
    xlab("Race") + ylab("Frequency") + facet_wrap(~schtyp, ncol = 2) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 0))

Use these with caution since they are useless unless the differences in the frequencies are very conspicuous and hence will jump out at the viewer.

9.6 Line Charts

If you have data over time then line charts are a good way to show trends over time. Run the code below (first making sure have the plotly package installed) and see the result.

library(plotly)
p = plot_ly(economics, x = date, y = uempmed, name = "unemployment")

plotly is a special graphics package for interactive graphics so don’t think this is how the typical line chart might look. For example, the same plot rendered via ggplot2 would look as follows:

ggplot(data = economics, aes(x = date, y = uempmed)) + geom_line()

A little touch of magic via ggplot and the plotly package, and voila!!

p2 = ggplot(data = economics, aes(x = date, y = uempmed)) + geom_line(aes(color = "red"), 
    size = 0.25) + xlab("Date") + ylab("Unemployment Rate")
p2 = ggplotly(p2)

Regardless of the package-specific rendering, the basic point should be obvious: You can see how unemployment varies over time. If you are interested, check out plotly’s capabilities here.

9.7 Scatter-plots

If we have two numeric (scale) variables then a scatter-plot is a great way to explore if and how these two variables are related.

ggplot(data = iris, aes(x = Sepal.Length, y = Petal.Width, color = Species)) + 
    geom_point()

ggplot(data = mtcars, aes(x = qsec, y = mpg, color = factor(cyl))) + 
    geom_point()

There is another package for interactive graphics , the ggvis package. The same plot as that rendered above can be drawn with ggvis (not shown below).

library(ggvis)
iris %>% ggvis(~Sepal.Length, ~Petal.Width) %>% layer_points(fill = ~Species)

More examples of ggvis graphics are available here.

Finally, let us close out scatter-plots by looking at the rCharts package, yet another way to generate interactive graphics (not shown below).

library(rCharts)
scatter.rcharts <- rPlot(Petal.Width ~ Sepal.Length, data = iris, 
    color = "Species", type = "point")
scatter.rcharts$print(include_assets = TRUE)

We can also build some bar-charts (see below):

hair_eye_male <- subset(as.data.frame(HairEyeColor), Sex == "Male")
n1 <- nPlot(Freq ~ Hair, group = "Eye", data = hair_eye_male, 
    type = "multiBarChart")
n1$print("iframesrc", cdn = FALSE, include_assets = TRUE)

More documentation on rCharts is available here.

9.8 Stripcharts

ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) + 
    geom_jitter()

ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) + 
    geom_jitter() + stat_summary(fun.y = median, geom = "point", 
    size = 3, color = "black")

ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) + 
    geom_jitter() + stat_summary(fun.data = "mean_sdl", geom = "pointrange", 
    size = 0.5, color = "black")

ggplot(data = iris, aes(y = Sepal.Length, x = Species, color = Species)) + 
    geom_boxplot() + geom_jitter()

10 A Teaser on Mapping with ggplot2 and ggmap

I’ll leave you with a few maps, first of the 48 states on the continent, then of all counties in the country, then one of counties in Ohio, and finally a googlemap of Athens.

library(maps)
library(ggmap)

states = map_data("state")

ggplot() + geom_polygon(data = states, aes(x = long, y = lat, 
    group = group, fill = region)) + coord_fixed(1.3) + guides(fill = FALSE)

counties <- map_data("county")

ggplot() + geom_polygon(data = counties, aes(x = long, y = lat, 
    group = group, fill = region)) + coord_fixed(1.3) + guides(fill = FALSE)

ohio = subset(counties, region == "ohio")

ggplot() + geom_polygon(data = ohio, aes(x = long, y = lat, group = group, 
    fill = subregion)) + coord_fixed(1.3) + guides(fill = FALSE)

athens = get_map(location = "Athens, Ohio", zoom = 14, source = "osm")
ggmap(athens)

sultanahmet = get_map(location = "Istanbul, Turkey", zoom = 12, 
    source = "osm")
ggmap(sultanahmet)